In this project our efforts are focused on the analysis of data included in the csv file provided, to primarily understand the potential relationship between different parameters and the incidences of cancer across counties in the US. The main objectives are: >1. To understand factors that predict cancer mortality rate, with the ultimate aim of identifying communities for social interventions. >2. To determine which interventions are likely to have the most impact.
Cancer <- read.csv('cancer.csv', row.names = 1)
source('load.R')
#summary(Cancer) #summary statistics
#str(Cancer, max.level = 1, strict.width = "wrap")
colnames(Cancer)
[1] "avgAnnCount" "medIncome" "popEst2015"
[4] "povertyPercent" "binnedInc" "MedianAge"
[7] "MedianAgeMale" "MedianAgeFemale" "Geography"
[10] "AvgHouseholdSize" "PercentMarried" "PctNoHS18_24"
[13] "PctHS18_24" "PctSomeCol18_24" "PctBachDeg18_24"
[16] "PctHS25_Over" "PctBachDeg25_Over" "PctEmployed16_Over"
[19] "PctUnemployed16_Over" "PctPrivateCoverage" "PctEmpPrivCoverage"
[22] "PctPublicCoverage" "PctWhite" "PctBlack"
[25] "PctAsian" "PctOtherRace" "PctMarriedHouseholds"
[28] "BirthRate" "deathRate" "color"
[31] "incidenceRate"
cat(" \n")
print(paste0('Number of rows: ', nrow(Cancer)))
[1] "Number of rows: 3047"
print(paste0('Number of columns: ', ncol(Cancer)))
[1] "Number of columns: 31"
The cancer.csv file contains 29 variables (30 columns, including the first one that only contains the row numbers) and 3,047 observations. Each observation (i.e. row) includes data for a county across the US. The variables are mostly numbers and integers, except for 2 that are factors (binnedInc and Geography). Below, we explain the variables in detail and provide our assessment of the quality of the data.
data on smoking and obesity and other cancer risk factors could’ve been very helpful
avgAnnCount: The average number of new cancer cases per year per county for years 2009-2013popEst2015: Estimated population by county 2015medIncome: Median income per countypovertyPercent: Percent of population below poverty linebinnedInc: ???MedianAge: Median age per countyMedianAgeMale: Median age among males per countyMedianAgeFemale: Median age among females per countyGeography: County, State namesPercentMarried: Percentage of married populationPctMarriedHouseholds: Percentage of married households per countyPctNoHS18_24: Percentage of 18-24 year old population with no high school educationPctHS18_24: Percentage of 18-24 year old population with high school educationPctSomeCol18_24: Percentage of 18-24 year old population with some college educationPctBachDeg18_24: Percentage of 18-24 year old population with bachelor’s degreePctHS25_Over: Percentage of population above 25 years old with high school educationPctBachDeg25_Over: Percentage of population above 25 years old with bachelor’s degreeAvgHouseholdSize: Average household size per countyPctEmployed16_Over: Percentage of population above 16 years old who have jobsPctUnemployed16_Over: Percentage of population above 16 years old with no jobsPctPrivateCoverage: Percentage of the population with private insurance coveragePctEmpPrivCoverage: percentage of the population with employer-sponsored insurance coveragePctPublicCoverage: Percentage of the population with public insurance coveragePctWhite: Percentage of white population by countyPctBlack: Percentage of African-American population by countyPctAsian: Percentage of Asian population by countyPctOtherRace: Percentage of other races by countyBirthRate: Birth rate per countydeathRate: Death rate per countyBased on the outputs from diagnostic and summary statistics functions, as well as further univariate analysis, using relevant charts, below we describe our evaluation of the dataset and its variables. Since definition of most variables was not provided to us, our first step was to ensure understanding of what exactly such variables represent. We also evaluated the data to identify potentially erroneous values, extreme outliers and variables that might require transformation.
from the assignment document: Evaluate the data quality. Are there any issues with the data? Explain how you handled these potential issues. Explain whether any data processing or preparation is required for your data set.
Data time frame:: While avgAnnCount represents statistics for 2009-2013, the population by county is for 2015 and other variables do not have date stamps. Ideally all variables should have been from the same time period.
avgAnnCount definition:: There is no clear definition for incidence rate per county for the avgAnnCount variable. Since the sum of all values is 1,847,514 and that based on Cancer.gov) data the average number of cases for 2009-2013 is 1,617,144, we assume this variable represents the actual count of new cases. Therefore, in our analysis we created a new variable called incidenceRate to represent the incidence rate of cancer per 100,000 people per county to be able to compare the spread of new cancer cases in different geographical regions regarldess of the actual population of such regions.
#caclulating the total for avgAnnCount to compare with offical reports by Cancer.gov
sum(Cancer$avgAnnCount)
[1] 1504941
Official Cancer Statistics, 2009-2013
Source: [Cancer.gov](https://www.cancer.gov/)
| Year | New Cases | Deaths |
|---|---|---|
| 2009 | 1,660,290 | 562,340 |
| 2010 | 1,529,560 | 569,490 |
| 2011 | 1,596,670 | 571,950 |
| 2012 | 1,638,910 | 577,190 |
| 2013 | 1,660,290 | 580,350 |
#calculating the mean of the number of new cancer cases for years 2009-2013, based on Cancer.gov data, in order to cofirm our assumption regarding avgAnnCount
incidence_cancer <- c(1660290, 1529560, 1596670, 1638910, 1660290)
mean(incidence_cancer)
[1] 1617144
avgAnnCount: Through our assessment, we noticed that the number of new cancer cases (avgAnnCount) for 6 counties were greater than those counties’ populations (popEst2015). Looking at the 6 observations, we realized that the value assigned to popEst2015 for all these 6 counties is exactly the same number (1962.667684). In fact there are a total of 206 counties that have exactly the same average number of new cases, which is probably an error in the dataset. We decided to replace all of them with NA in our analysis.#cheking the number of observatios, where new case count is greater than the population
sum(Cancer$avgAnnCount > Cancer$popEst2015, na.rm = TRUE)
[1] 0
Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA #removing the potentially erroneous number
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000 #creating a new variable: new cases per 100,00 people per county
Geography: We checked this variable to identify potential duplicates. Since the number of unique values in this column (3,047) is equal to the total number of observations, there can not be any duplicates in this column.#checking for potential dubplicates in this variable
length(unique(Cancer[["Geography"]]))
[1] 3047
binnedInc: This variable has 10 levels that seem arbitrary and have different bin sizes. It is not clear why the income bins have been defined this way. As a result, we decided to ignore it in our analysis.
Anomaly in MedianAge: The maximum MedianAge shows a value of 624, which is clearly a wrong number. We actually identified a total of 30 values in this column that are above 100; therefore, we will replace such values with NA in our analysis.
age_error = subset(Cancer, MedianAge > 100) #checking the number of erroneous values
nrow(age_error)
[1] 0
Cancer$MedianAge[Cancer$MedianAge > 100] = NA #replacing erroneous values with NA
AvgHouseholdSize: The minimum for AvgHouseholdSize is 0.0221, which does not make sense, since we do not expect a household size below 1. There are 61 values in this column that are below 1, which we will replace with NA in our analysis.household_error = subset(Cancer, AvgHouseholdSize < 1) #checking the number of erroneous values
nrow(household_error)
[1] 0
Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize < 1] = NA #replacing erroneous values with NA
PctSomeCol18_24: 75% of values within this variable are NAs (2285 out 3047). Therefore, we decided to ignore this variable in our analysis.
BirthRate: It is not clear what exactly this represents. Often, the birth rate is defined as childbirths per 1,000 people per year, but applying that to this variable would not give us the right number. For example in Los Angeles County with the population of 10,170,292, there were 124,641 live births in 2015 based on official reports, which translates into a birth rate of 12.25 (BR = (b / p) X 1,000). However, the birth rate in our data shows a value of 4.7 for this county, which is probably the ratio of women aged 15-50 years old who gave birth in 2015 as reported by TownCharts). As a result, we decided to ignore this variable in our analysis.
#checking the BirthRate value for Los Angeles County
Cancer[1000,'BirthRate']
[1] 4.705281
#Calculating LA County birth rate based on official figures. Formula: BR = (b ÷ p) X 1,000
124641/10170292*1000
[1] 12.2554
deathRate: Based on our assessment, we believe this variable represents the number of deaths due to cancer per 100,000 population per county. For instance, we looked at the figure for Kings County, NY (173.6) and the number in our data is closer to the officially reported cancer death rate (140.3), as opposed to overall death rate (603.1). We also calculated the actual number of deaths per county (deathRate * popEst2015 / 100000) and the total for these values, which is eaual to 525,347. This number is pretty close to the figure reported by Cancer.gov (589,430), further confirming our assumption regarding deathRate.#checking the deathRate for Kings County, NY
Cancer[388, 'deathRate']
[1] 173.6
Kings County, NY statistics:
2015 population: 2,673,000
2015 death rate (per 100,000 population): 603.1
2015 Cancer death rate (per 100,000 population): 140.3
Sources: DATA USA, NY State Dpt of Health
#comparing total death count in our dataset with official stats reported by health authorities
Cancer$death_count <- Cancer$deathRate * Cancer$popEst2015/100000
sum(Cancer$death_count)
[1] 525347.7
PctEmpPrivCoverage: We assume that the values in this variable represent a subset of values in PctPrivateCoverage, since the sum of these two variables in some rows is above 100.
Overlap between PctPrivateCoverage and PctPublicCoverage: We assume that there is an overlap between people that have public health insurance and those with private health insurance, since the sum of PctPrivateCoverage and PctPublicCoverage in some rows is above 100. In fact, this is not uncommon among some senior citizenz that have both Medicare and a supplementary private health plan (aka. Medigap).
#adding up health insurance coverage variables, to makes sence of such variables and check for overlaps
Cancer$Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
Cancer$Pct_PersonalIsure <- Cancer$PctPrivateCoverage + Cancer$PctEmpPrivCoverage
print('Cancer$Pct_insured')
[1] "Cancer$Pct_insured"
summary(Cancer$Pct_insured)
Min. 1st Qu. Median Mean 3rd Qu. Max.
65.40 96.25 101.30 100.61 105.80 131.70
print('Cancer$Pct_PersonalIsure')
[1] "Cancer$Pct_PersonalIsure"
summary(Cancer$Pct_PersonalIsure)
Min. 1st Qu. Median Mean 3rd Qu. Max.
35.8 92.2 106.3 105.6 118.9 163.0
Based on the data evaluation mentiond before and additional analysis, we are transforming some of the valuables that have issues, as explained below.
#creating separarte County and State columns to enable state-wide analysis
Cancer <- Cancer %>% separate(Geography, c("County", "State"), sep = ", ", remove = FALSE)
#replacing erroneous values with NA
Cancer$MedianAge[Cancer$MedianAge > 100] <- NA
Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize < 1] <- NA
#Ramiro to add comment
bins <- seq(20000, 130000, by = 10000)
Cancer$binnedInc2 <- cut(Cancer$medIncome, breaks = bins)
# Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA
# Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
#creating a new variable to represent actual number of deaths due to cancer per county
Cancer$death_count <- Cancer$deathRate * Cancer$popEst2015/100000
#creating a new variable to represent the percentage of all people that have health insurance
Cancer$Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
#Ramiro to add note
Cancer$Pct_PersonalIsure <- Cancer$PctPrivateCoverage + Cancer$PctEmpPrivCoverage
Even though the presentation of this section takes a linear form, the actual analysis of key variables was an iterative process. The key variables were chosen based on: * Our initial hypotheses regarding variables were potentially related to deathRate. * The possibility that a variable/factor could be changed through interventions implemented by government health agencies to improve cancer prevention and survival. * Additional analysis that we performed to identify variables that actually had a correlation with the dependent variable.
After selecting the key variables, our approach was to focus on assessing the quality of the data (as partly explained in the Introduction) and detecting features, through univariate analysis, that are important to include when modelling the relationships of interest, such as particular features in the distributions, unusual concentrations of observations around certain values, the presence of outliers and extreme outliers, among others.
Death rate’s distribution is symmetric and bell-shaped, with a small amount of outliers at both sides of the mean (2.1% of outliers, with 0.03% of extreme outliers). However, these outliers are still within a reasonable range and do not seem to be errors in the data. Furthermore, the observation corresponding to the only extreme outlier does not look atypical based on the values of the other variables.
Finally, using both summary metrics and visualizations, we did not find any unusual concentration of observations around specific values.
summary(Cancer$deathRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.7 161.2 178.1 178.7 195.2 362.8
boxHist(Cancer$deathRate, "The Cancer Death Rate", "Nummber of Deaths per 100K People")
outliers.summ(Cancer, 'deathRate')
## [1] "Outliers: 64 (2.1%)"
## [1] "Extreme outliers: 1 (0.03%)"
Cancer[Cancer$deathRate > 300, ]
The distribution of the incidence rate is unimodal and positively skewed, with 46 outliers and 1 extreme outlier. Since these values represent only 1.5% of observations and there is no furhter evidence that they are errors, they will be kept, but should be taken into account when modelling the relationship between incidence and death rates.
summary(Cancer$incidenceRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 140.3 480.8 550.7 550.7 621.6 1404.8
boxHist(Cancer$incidenceRate, "Cancer Incidence Rate", "New Diagnosed Cases per 100K People")
outliers.summ(Cancer, 'incidenceRate')
## [1] "Outliers: 77 (2.53%)"
## [1] "Extreme outliers: 1 (0.03%)"
There are two income variables available: binned income and median income. From these two, We chose median income as our key variable because it is more granular than binned income and, second, because the width of the binned income seem to have been defined to have a similar number of observations in each bin, which is not useful to observe its distribution, and the cutoffs chosen make the charts hard to read.
summary(Cancer$binnedInc)
## Length Class Mode
## 3047 character character
Below, we can see that the median income is inded a good candidate, since it doesn’t vary as much as income typically does (in this case, the difference between the minimum and maximum values is less than one order of magnitude), representing better the “average” member of each county. However, it’s distribution is positively sekewed, having 64 counties where the median income is higher than 80,000 USD.
summary(Cancer$medIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22640 38883 45207 47063 52492 125635
sum(Cancer$medIncome > 80000)
## [1] 64
boxHist(Cancer$medIncome, "The Median Income")
Including the 64 observations above that contribute to the positive skewness of this variable, there are still 122 outliers (around 4% of the total observations) that need to be taken into account when building the statistical model that captures the relationship between this variable and the death rate.
outliers.summ(Cancer, "medIncome")
## [1] "Outliers: 122 (4%)"
## [1] "Extreme outliers: 18 (0.59%)"
Given the rather large number of outliers in this variable, we could transform it by taking its logarithm. However, we have decided to follow the rule provided by Fox (2011), where logarithmic transformation is only likely to make a difference if its values “cover two or more orders of magnitude” (Fox, p. 128).
To measure education, we have six possible candidates: ‘PctNoHS18_24’, ‘PctHS18_24’, ‘PctSomeCol18_24’, ‘PctBachDeg18_24’, ‘PctHS25_Over’ and ‘PctBachDeg25_Over’ that can be divided in two groups: 18-24 and ‘25 and above’ years old. Our initial hypothesis is that the second group should have a stronger correlation with death rate. We validated this hypothesis with the correlations table shown below, that found that only PctBachDeg from the 18-24 group has a correlation with deathRate (although this correlation is very week, -0.31). Instead, as expected, the two’25 and above’ education variables have a much higher correlation with deathRate.
Therefore, we will focus on these two variables for further analyses on education.
cor(Cancer[, names(Cancer) %in%
c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24',
'PctHS25_Over', 'PctBachDeg25_Over', 'deathRate')], use = 'complete.obs')[7, ]
## PctNoHS18_24 PctHS18_24 PctSomeCol18_24 PctBachDeg18_24
## 0.1219703 0.2665730 -0.1886877 -0.3140130
## PctHS25_Over PctBachDeg25_Over deathRate
## 0.4182411 -0.4717962 1.0000000
We also validated that education variables within each group are mutually exclusive, by making sure that they add up to 100%, for all observations that have complete data, where we find that these variables indeed seem to be mutually exclusive, given that their range is between 99.9 and 100.1, where the small variations around 100 are likely due to rounding.
We can only test this with the 18-24 group since the 25_over group is missing two variables that capture ‘no high school’ and ‘some college’. However, it is reasonable to assume that the same definition is applied to our group of interest (25_over).
educ.18.24 <- c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24')
educ.df <- subset(Cancer, select = educ.18.24)
educ.complete <- complete.cases(educ.df)
sum.pct.freq <- data.frame(table(rowSums(educ.df[educ.complete, ], na.rm = TRUE)))
names(sum.pct.freq) <- c("Sum", "Frequency")
sum.pct.freq
Values in PctHS25 are within a reasonable range (7 to 55%) and there doesn’t seem to be an unusual concentration of observations around certain values. Also, the disribution of this variable is unimodal and negatively skewed. However, it only contains 31 outliers (1% of observations) and there are no extreme outliers. Furthermore, there are no indications that these outliers are errors, so we decided to keep them.
summary(Cancer$PctHS25_Over)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.50 30.40 35.30 34.80 39.65 54.80
boxHist(Cancer$PctHS25_Over, "Percentage age 25 or older with high school only")
outliers.summ(Cancer, "PctHS25_Over")
## [1] "Outliers: 31 (1.02%)"
## [1] "Extreme outliers: 0 (0%)"
Values in PctHS25_Over are within a reasonable range (7% to 55%) and there doesn’t seem to be an unusual concentration of observations around certain values. The disribution of this variable is unimodal and positively skewed. It contains 82 outliers (2.7% of observations) all of which at are at the right side of the mean. Of these 82 outliers, only 5 are extreme outliers, that will be kept in the data set, since there are no indications that they are errors.
summary(Cancer$PctBachDeg25_Over)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.50 9.40 12.30 13.28 16.10 42.20
boxHist(Cancer$PctBachDeg25_Over, "Percentage age 25 or older with bachelors degree only")
Extreme outliers
outliers.summ(Cancer, "PctBachDeg25_Over")
## [1] "Outliers: 82 (2.69%)"
## [1] "Extreme outliers: 5 (0.16%)"
The distribution of povertyPercent is unimodal and positively skewed. This is reflected by the fact that all outliers are at the right of the mean. Taking a deeper dive into the outliers, we found that only 3 are extreme while 66 are mild. For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable.
However, when modeling the relationship of interest, we should take into account that the distribution of this variable is not normal and it may be necessary to transform it if the model used requires it.
summary(Cancer$povertyPercent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.20 12.15 15.90 16.88 20.40 47.40
boxHist(Cancer$povertyPercent, "Percentage age 25 or older with up to bachelors degree")
Extreme outliers
outliers.summ(Cancer, "povertyPercent")
## [1] "Outliers: 69 (2.26%)"
## [1] "Extreme outliers: 3 (0.1%)"
The distribution of PctEmployed16_Over is unimodal and negatively skewed. This is reflected by the fact that all but one of the outliers are at the left of the mean. There are no extreme outliers and 20 mild outliers (0.7% of observations). For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable.
summary(Cancer$PctEmployed16_Over)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 17.60 48.60 54.50 54.15 60.30 80.10 152
boxHist(Cancer$PctEmployed16_Over, "Percentage age 25 or older with up to bachelors degree")
## Warning: Removed 152 rows containing non-finite values (stat_boxplot).
## Warning: Removed 152 rows containing missing values (geom_point).
## Warning: Removed 152 rows containing non-finite values (stat_bin).
## Warning: Removed 152 rows containing non-finite values (stat_bin).
Extreme outliers
outliers.summ(Cancer, "PctEmployed16_Over")
## [1] "Outliers: 20 (0.66%)"
## [1] "Extreme outliers: 0 (0%)"
The distribution of PctPublicCoverage is unimodal and symmetric, with no extreme outliers and only 18 mild outliers (0.6% of observations). For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable. There are also no other particular features from this variables that grant further warnings in modelling the relationship between it and deathRate.
summary(Cancer$PctPublicCoverage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.20 30.90 36.30 36.25 41.55 65.10
boxHist(Cancer$PctPublicCoverage, "Percentage age 25 or older with up to bachelors degree")
Extreme outliers
outliers.summ(Cancer, "PctPublicCoverage")
## [1] "Outliers: 18 (0.59%)"
## [1] "Extreme outliers: 0 (0%)"
As explained above, guided by our hypothesis that the education of the ‘25 and over’ years old group should have a much stronger relationship with deathRate than the ‘18-24’ years old group, which was supported by the correlations between these variables, we will be focusing on the former group.
A correlation of \(0.4\) between PctHS25_over and deathRate indicates that there is indeed a relationship between these variables, which is further indicated by plotting them together in a scatterplot, that shows that higher values of percentage of population with only high school tend to be associated to higher death rates (this is also reflected in the regression line added to the scatterplot).
This is an intuitive result since it indicates that a higher concentration of people with low education levels may have poorer health habits and lower access to medical services. However, both of these variables could be affeccted by MedianAge in the same direction: older counties might have lower levels of higher education and higher rates of death.
cor(Cancer$deathRate, Cancer$PctHS25_Over)
## [1] 0.4045891
#plot(Cancer$PctHS25_Over, Cancer$deathRate, main = "HS (>24)")
#abline(lm(Cancer$deathRate ~ Cancer$PctHS25_Over), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$deathRate, Cancer$PctHS25_Over, "HS Education (>24)")
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
A correlation of \(-0.48\) indicates that there is relationship between PctBachDeg25_over and deathRate, which is further supported by plotting these variables in a scatterplot, where it can be seen that higher values of percentage of people with bachelors degree are associated to lower levels of death rates. This relationship is also supported by the regression line included in the scatterplot.
This is also an intuitive result, since higher levels of education might be linked to better health habits and access to health services. However, and following the same reasoning than PctHS25_over, the relationship between these two variables may be confounded by MedianAge, although it is not clear in which direction this effect might go. Therefore, it will also be necessary to explore the effect of MedianAge in the following section.
cor(Cancer$deathRate, Cancer$PctBachDeg25_Over)
## [1] -0.4854773
#plot(Cancer$PctBachDeg25_Over, Cancer$deathRate, main = "Bachelor (>24)")
#abline(lm(Cancer$deathRate ~ Cancer$PctBachDeg25_Over), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$PctBachDeg25_Over, Cancer$deathRate, "Bachelor's Degree (>24)")
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
Throughout the analyses above, we began to identify that some of the relationships found between deathRate and other variables may not only be capturing the direct relationship betwee these variables but of additional variable(s) that may be impacting both. To further assess this systematically, the following network visualization shows the variables that have a correlation higher than 0.4, where each node represents a different variable and each vertex indicates the strength of the relationship between the variables connected.
PercentMarried has a (weak) relation and AvgHouseholdSize has a moderate relation with MedianAge. Based on these results, we explored this relationship further.
cor(subset(Cancer,
select = c("MedianAge", "AvgHouseholdSize", "PercentMarried")),
use = "pairwise.complete.obs")[1, ]
## MedianAge AvgHouseholdSize PercentMarried
## 1.0000000 -0.6138722 0.4290795
Both the scatterplots below and the regression lines imposed on them provide further support that there is indeed a relationship between these two variabes and MedianAge, indicating that MedianAge may confound the relationship between these two and deathRate. Therefore, this should be taken into account when modelling the relation of interest, in order to isolate the effect of the family variables on the death rate.
#plot(Cancer$MedianAge, Cancer$PercentMarried, main = "Age vs PercentMarried")
#abline(lm(PercentMarried ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$PercentMarried, Cancer$MedianAge, "Age vs PercentMarried")
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
#plot(Cancer$MedianAge, Cancer$AvgHouseholdSize, main = "Age vs Average household size")
#abline(lm(AvgHouseholdSize ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$AvgHouseholdSize, Cancer$MedianAge, "Age vs Household Size")
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
## Warning in if (is.na(x)) {: the condition has length > 1 and only the first
## element will be used
Correlation analysis shows that there is a relationship between the percentage of black population and employment, which is further confirmed both by a visual inspection of the scatterplot and the linear regression line charted in this plot. Since employment es related to deathRate, its correlation with PctBlack may indicate that this variable may be confounding the relationship of interest and thus further modeling needs to take this into account, to isolate the effect of unemployment on death rate.
cor(subset(Cancer, select = c("PctBlack", "PctUnemployed16_Over")),
use = "pairwise.complete.obs")[, 1]
## PctBlack PctUnemployed16_Over
## 1.0000000 0.4692731
plot(Cancer$PctBlack, Cancer$PctUnemployed16_Over, main = "Age vs PercentMarried")
abline(lm(PctUnemployed16_Over ~ PctBlack, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
A boxplot containing different location measures of deathRate by State shows that these values vary measures vary significantly across state. Since State may be capturing several state-level characteristics that may in turn affect other variables that have a relation with deathRate, it is recommended to include state-level effects when modeling the relation of interest, to control for confounding these state-level factors.
boxplot(Cancer$deathRate ~ Cancer$State)